Universal Learning Theory 



Marcus Hutter 

RSISE @ ANU and SML @ NICTA 
Canberra, ACT, 0200, Australia 
marcusShutterl . net www . hutter 1 . net 

February 2011 

Abstract 

This encyclopedic article gives a mini-introduction into the theory of uni- 
versal learning, founded by Ray Solomonoff in the 1960s and significantly 
developed and extended in the last decade. It explains the spirit of universal 
learning, but necessarily glosses over technical subtleties. 
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1 Definition, Motivation and Background 



Universal (machine) learning is concerned with the development and study of algo- 
rithms that are able to learn from data in a very large range of environments with as 
few assumptions as possible. The class of environments typically considered includes 
all computable stochastic processes. The investigated learning tasks range from in- 
ductive inference, sequence prediction, sequential decisions, to (re) active problems 
like reinforcement learning |Hut05] . but also include clustering, regression, and oth- 
ers [LVQSj . Despite various no-free- lunch theorems |WM97] . universal learning is 
possible by assuming that the data possess some effective structure, but without 
specifying any further, which structure. Learning algorithms that are universal (at 
least to some degree) are also necessary for developing autonomous general intelli- 
gent systems, required e.g. for exploring other planets, as opposed to decision sup- 
port systems which keep a human in the loop. There is also an intrinsic interest in 
striving for generality: Finding new learning algorithms for every particular (new) 
problem is possible but cumbersome and prone to disagreement or contradiction. 
A sound formal general and ideally complete theory of learning can unify existing 
approaches, guide the development of practical learning algorithms, and last but 
not least lead to novel and deep insights. 

This encyclopedic article gives a mini-introduction into the theory of universal 
learning, founded by Ray Solomonoff in the 1960s [Sol64t ISol78] and significantly 
developed and extended by the author and his colleagues in the last decade. It is 
based on [HutOSj . It explains the spirit of universal learning, but necessarily glosses 
over many technical subtleties. Precise formulation of the results with proofs and/or 
references to original publications can be found in [HutOSj . 

2 Deterministic Environments 

Let t,n G IV be natural numbers, X* be the set of finite strings and be the set 
of infinite sequences over some alphabet X of size For a string x^X* of length 
i{x)=n we write XiX2.--Xn with XtEX, and further abbreviate Xt:„ :=XfXt+i...x„_iX„ 
and x<n := Xi...x„„i, and e = x<i for the empty string. Consider a countable class 
of hypotheses = {Hi,H2,...}. Each hypothesis HEAi (also called model) shall 
describe an infinite sequence x^.^, e.g. like in IQ test questions "2,4,6,8,....". In 
online learning, for t = 1,2,3,..., we predict Xt based on past observations x<f, then 
nature reveals Xt, and so on, where the dot above x indicates the true observation. 
We assume that the true hypothesis is in A^, i.e. Xi;oo = x^.^ for some mElN. Goal 
is to ( "quickly" ) identify the unknown Hm from the observations. 

Learning by enumeration works as follows: Let Ait = {H G M. : x^^ = x<ct} be 
the set of hypotheses consistent with our observations x^t so far. The hypothesis 
in Ait with smallest index, say mj, is selected and used for predicting xt. Then xt 
is observed and all H G Ait inconsistent with Xt are eliminated, i.e. they are not 
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included in A4t+i- Every prediction error results in the elimination of at least ifmj, 
so after at most m — 1 errors, the true hypothesis Hm gets selected forever, since it 
never makes an error {Hm^z M.t'^t). This identification may take arbitrarily long 
(in t), but the number of errors on the way is bounded by m — 1, and the latter is 
often more important. As an example for which the bound is attained, consider Hi 
with x^j^ := l-^*^*''0°° Vi for any strictly increasing function /, e.g. f(i)=i. But we 
now show that we can do much better than this, at least for finite X. 

Majority learning. Consider (temporarily in this paragraph only) a binary alpha- 
bet A:' = {0,1} and a finite deterministic hypothesis class M = {Hi,H2,...,Hn}- Hm 
and Ait are as before, but now we take a majority vote among the hypotheses in 
M.t as our prediction of Xt- If the prediction turns out to be wrong, then at least 
half (the majority) of the hypotheses get eliminated from Aif Hence after at most 
logA^ errors, there is only a single hypothesis, namely Hm-, left over. So this majority 
predictor makes at most logA^ errors. As an example where this bound is essentially 
attained, consider m = N = 2^ — 1 and let x^.i^ be the digits after the comma of the 
binary expansion of (? — 1)/2" for i = l,...,N. 

Weighted majority for countable classes. Majority learning can be adapted to 
denumerable classes M. and general finite alphabet X as follows: Each hypothesis 
Hi is assigned a weight if j > with Yli^i — 1- Let W := '^i-H.eMt'^i total 
weight of the hypotheses in Ait- Let Ai^ := {Hi e A4t : x^' = a} be the consistent 
hypotheses predicting Xt = a, and Wa their weight, and take the weighted majority 
prediction Xt = axgmaXaWa- Similarly as above, a prediction error decreases W by 
a factor of 1 — I/IA:"!, since max^PI/a > Since Wm'£W <1, this algorithm can 

at most make \og^_^/^p^^Wm = 0(\ogw:^^) prediction errors. If we choose for instance 
Wj=(i+1)~^, the number of errors is O(logm), which is an exponential improvement 
over the Gold-style learning by enumeration above. 

3 Algorithmic Probability 

Algorithmic probability has been founded by Solomonoff |Sol64] . The so-called uni- 
versal probability or a-priori probability is the key quantity for universal learning. 
Its philosophical and technical roots are Ockham 's razor (choose the simplest model 
consistent with the data), Epicurus' principle of multiple explanations (keep all 
explanations consistent with the data), (Universal) Turing machines (to compute, 
quantify and assign codes to all quantities of interest), and Kolmogorov complexity 
(to define what simplicity/complexity means). This section considers determinis- 
tic computable sequences, and the next section the general setup of computable 
probability distributions. 

(Universal) monotone Turing machines. Since we consider infinite computable 
sequences, we need devices that convert input data streams to output data streams. 
For this we define the following variants of a classical deterministic Turing Machine: 
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A monotone Turing machine T is defined as a Turing machine with one unidirectional 
input tape, one unidirectional output tape, and some bidirectional work tapes. The 
input tape is binary (no blank) and read only, the output tape is over finite alphabet 
X (no blank) and write only, unidirectional tapes are those where the head can only 
move from left to right, work tapes are initially filled with zeros and the output 
tape with some fixed element from X. We say that monotone Turing machine T 
outputs/computes a string starting with x on input p, and write T{p) =x* if p is 
to the left of the input head when the last bit of x is output (T reads all of p but 
no more). T may continue operation and need not halt. For a given x, the set 
of such p forms a prefix code. Such codes are called minimal programs. Similarly 
we write T{p) =ijj if p outputs the infinite sequence u. A prefix code V is a set of 
binary strings such that no element is proper prefix of another. It satisfies Kraft's 
inequality X^pg-pS"^*^^) < 1. 

The table of rules of a Turing machine T can be prefix encoded in a canonical way 
as a binary string, denoted by (T). Hence, the set of Turing machines {Ti,T2,...} can 
be effectively enumerated. There are so-called universal Turing machines that can 
"simulate" all other Turing machines. We define a particular one which simulates 
monotone Turing machine T{q) if fed with input {T)q, i.e. U{{T)q) =T{q) ^T,q. 
Note that for p not of the form (T)g, U{p) does not output anything. We call this 
particular U the reference universal Turing machine. 

Universal weighted majority learning. Ti(e),T2(e),... constitutes an effective 
enumeration of all finite and infinite computable sequences, hence also monotone 
U{p) for pG{0,l}*. As argued below, the class of computable infinite sequences, is 
conceptually very interesting. The halting problem implies that there is no recursive 
enumeration of all partial-recursive functions with infinite domain; hence we cannot 
remove the finite sequences algorithmically. It is very fortunate that we don't have 
to. Hypothesis Hp is identified with the sequence U{p), which may be finite, infinite, 
or possibly even empty. The class of considered hypotheses is M. := {iJpipG {0,1}*}. 

The weighted majority algorithm also needs weights Wp for each Hp. Ockham's 
razor combined with Epicurus' principle demand to assign a high (low) prior weight 
to a simple (complex) hypothesis. If complexity is identified with program length, 
then Wp should be a decreasing function of i{p). It turns out that Wp = 2-^(p) is 
the "right" choice, since minimal p form a prefix code and therefore Ylp'^p < 1 as 
required. 

Using Hp for prediction can now fail in two ways. Hp may make a wrong pre- 
diction or no prediction at all for Xt. The true hypothesis H^ is still assumed 
to produce an infinite sequence. The weighted majority algorithm in this setting 
makes at most 0(logw~^) = 0{i{p)) errors. It is also plausible that learning i{p) bits 
requires 0{i{p)) "trials". 

Universal mixture prediction. Solomonoff |Sol78] defined the following universal 
a-priori probability 

M{x) := 2"^^^^ (1) 

p-.U {p)=x* 
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That is, M{x) = W is the total weight of the computable deterministic hypotheses 
consistent with x for the universal weight choice Wp = 2~^'^\ The universal weighted 
majority algorithm predicted argmaXaM(i;<ta). Instead, one could also make a 
probability prediction M{a\x^t)'=M{x^ta)/M{x^t), which is the relative weight of 
hypotheses in Ait predicting a. The higher the probability M{xt\x^t) assigned to 
the true next observation x^, the better. Consider the absolute prediction error |1 — 
M{xt\x^-t;)\ and the logarithmic error —\ogM{xt\x^t)- The cumulative logarithmic 
error is bounded by Ylt=i~^^S^{it\i<t) — ~logM(i;i.„) <i{p) for any program p 
that prints x*. For instance p could be chosen as the shortest one printing i;i:oo, 
which has length Km{xi:oo) := min{£(p) : U{p) —Xi.oo}- Using l — z< — log^; and 
letting n ^ oo we get 

oo oo 

Y.l'^ - M{xt\x<t)\ < 5^-logM(i;t|i;<t) < Km{xi.,^) 
t=i t=i 

Hence again, the cumulative absolute and logarithmic errors are bounded by the 
number of bits required to describe the true environment. 



4 Universal Bayes 

The exposition so far has dealt with deterministic environments only. Data se- 
quences produced by real-world processes arc rarely as clean as IQ test sequences. 
They are often noisy. This section deals with stochastic sequences sampled from 
computable probabihty distributions. The developed theory can be regarded as 
an instantiation of Baycsian learning. Bayes' theorem allows to update beliefs in 
face of new information but is mute about how to choose the prior and the model 
class to begin with. Subjective choices based on prior knowledge are informal, and 
traditional 'objective' choices like Jeffrey's prior are not universal. Machine learn- 
ing, the computer science branch of statistics, develops (fully) automatic inference 
and decision algorithms for very large problems. Naturally, machine learning has 
(re) discovered and exploited different principles (Ockham's and Epicurus') for choos- 
ing priors, appropriate for this situation. This leads to an alternative representation 
of universal probability as a mixture over all lower semi-computable semimeasures 
with Kolmogorov complexity based prior as described below. 

Bayes. Sequences u = Ui-^oo G are now assumed to be sampled from the 
"true" probability measure /i, i.e. ii{xi:n) := P[co'i:„ = a;i:„|/x] is the /^-probability 
that uj starts with Expectations w.r.t. /i are denoted by E. In particu- 

lar for a function f : X"^ ^ R, we have E[/] = E[/(a;i:n)] = Ylxv.nl^i^'^-n)f{^i:n)- 
Note that in Bayesian learning, measures, environments, and models are the 
same objects; let M. = {vi,V2^...} = {H^^^Hy^^...} denote a countable class of these 
measures=hypotheses. Assume that fi is unknown but known to be a member of 
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Ai, and Wu'.= P[Hy] is the given prior belief in H,y. Then the Bayes mixture 



W, 



V 



must be our a-priori behef in and 'P[Hy\uJi;n = xi;rl\=Wui'{xi;n)li{xi;n) be our 
posterior behef in v by Bayes' rule. 

Universal choice of Al. Next we need to find a universal class of environments 
Aiu- Roughly speaking, Bayes works if Ai contains the true environment yU. The 
larger M. the less restrictive is this assumption. The class of all computable distribu- 
tions, although only countable, is pretty large from a practical point of view, since it 
includes for instance all of today's valid physics theories. (Finding a non-computable 
physical system would indeed overturn the generally accepted Church- Turing the- 
sis.) It is the largest class, relevant from a computational point of view. Solomonoff 
|Sol64t Eq.(13)] defined and studied the mixture over this class. 

One problem is that this class is not (effectively=recursively) enumerable, since 
the class of computable functions is not enumerable due to the halting problem, nor 
is it decidable whether a function is a measure. Hence ^ is completely incomputable. 
Leonid Levin |ZL70] had the idea to "slightly" extend the class and include also lower 
semi-computable semimeasures. 

A function i/: A'*— )■ [0,1] is a called a semimeasure iff l'{x)>^^^p^v{xa)\^x(^X* . 
It is a proper probability measure iff equality holds and z/(e) = l. v{x) still denotes 
the i/-probability that a sequence starts with string x. A function is called lower 
semi-computable, if it can be approximated from below. Similarly to that fact that 
the class of partial recursive functions is recursively enumerable, one can show that 
the class M.u = {z/i,i/2,...} of lower semi-computable semimeasures is recursively 
enumerable. In some sense A^[/ is the largest class of environments for which is 
in some sense computable, but even larger classes are possible |Sch02] . 

Kolmogorov complexity. Before we can turn to the prior w^, we need to quantify 
complexity /simplicity. Intuitively, a string is simple if it can be described in a few 
words, like "the string of one million ones" , and is complex if there is no such short 
description, like for a random object whose shortest description is specifying it bit 
by bit. We are interested in effective descriptions, and hence restrict decoders to be 
Turing machines. One can define the prefix Kolmogorov complexity of string x as 
the length i of the shortest halting program p for which U outputs x: 



Simple strings like 000. ..0 can be generated by short programs, and, hence have low 
Kolmogorov complexity, but irregular (e.g. random) strings are their own shortest 
description, and hence have high Kolmogorov complexity. For non-string objects 
(like numbers and functions) one defines K{o) :=K{{o)), where (o) G Af* is some 
standard code for a. In particular, K{i'i)=K{i). 



K{x) 



min{l{p) : U{p) = X halts} 



p 
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To be brief, K is an excellent universal complexity measure, suitable for quanti- 
fying Ockham's razor. 

The universal prior. We can now quantify a prior biased towards simple models. 
First, we quantify the complexity of an environment u or hypothesis H^, by its 
Kolmogorov complexity K{v). The universal prior should be a decreasing function 
in the model's complexity, and of course sum to (less than) one. Since — ^ 

by the prefix property and Kraft's inequality, this suggests the choice 

= := 1-'^^^ (2) 

Since logi < -ft'(z/j) < logi + 21oglog2 for "most" i, most z/j have prior approximately 
reciprocal to their index % as also advocated by Jeffreys and Rissanen. 

Representations. Combining the universal class M.^ with the universal prior 
we arrive at the universal mixture 

i^{x) := 2-^^^V(^) (3) 

which has remarkable properties. First, it is itself a lower semi- computable semimea- 
sure, that is ^(/G A^jy, which is very convenient. Note that for most classes, ^ ^M.. 

Second, coincides with M within an irrelevant multiplicative constant, and 
MeA^t/. This means that the mixture over deterministic computable sequences is 
as rich as the mixture over the much larger class of semi-computable semimeasures. 
The intuitive reason is that the probabilistic semimeasures are in the convex hull 
of the deterministic ones, and therefore need not be taken extra into account in the 
mixture. 

There is another, possibly the simplest, representation: One can show that M[x) 
is equal to the probability that \J outputs a string starting with x when provided 
with uniform random noise on the program tape. Note that a uniform distribution 
is also used in many no-free-lunch theorems to prove the impossibility of universal 
learners, but in our case the uniform distribution is piped through a universal Turing 
machine, which defeats these negative implications as we will see in the next section. 

5 Applications 

In the stochastic case, identification of the true hypothesis is problematic. The 
posterior P[if|x] may not concentrate around the true hypothesis if^ if there are 
other hypotheses Hy that are not asymptotically distinguishable from if^. But even 
if model identification {induction in the narrow sense) fails, predictions, decisions, 
and actions can be good, and indeed, for universal learning this is generally the case. 

Universal sequence prediction. Given a sequence xiX2---Xt-i, we want to predict 
its likely continuation Xf. We assume that the strings which have to be continued 
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are drawn from a computable "true" probability distribution /i. The maximal prior 
information a prediction algorithm can possess is the exact knowledge of fi, but often 
the true distribution is unknown. Instead, prediction is based on a guess p of /i. Let 
p{a\x) := p{xa) / p{x) be the "predictive" p-probability that the next symbol is aEX , 
given sequence x&X*. Since pEAiu it is natural to use or M for prediction. 

Solomonoff 's |Sol78t IHutOSj celebrated result indeed shows that M converges to 
H. For general alphabet it reads 

oo 

^E[j](M(a|a;<,)-/i(«l^<t))'] < K{p)ln2 + 0{1) (4) 

t=l aeX 

Analogous bounds hold for ^jj and for other than the Euclidian distance, e.g. the 
Hellinger and the absolute distance and the relative entropy. 

For a sequence 01,02,... of random variables, ^^iE[a^] <c<oo implies — )■ 
for t— T-oo with /^-probability 1 (w.p.l). Convergence is rapid in the sense that the 
probability that exceeds £ > at more than c/e6 times, is bounded by 6. This 
might loosely be called the number of errors. Hence Solomonoff 's bounds implies 

M(xj|(X'<t) — p{xt\uj^t) — > for any Xt rapid w.p.l for t — > 00 

The number of times, M deviates from p by more than £:>0 is bounded by 0{K{p)), 
i.e. is proportional to the complexity of the environment, which is again reasonable. 
A counting argument shows that 0{K{p)) errors for most p are unavoidable. No 
other choice for Wi, would lead to significantly better bounds. Again, in general 
it is not possible to determine when these "errors" occur. Multi-step lookahead 
convergence M{xt;nt\<^<t)—p(yXt:nt\^<t)^0 even for unbounded lookahead nt—t>0, 
relevant for delayed sequence prediction and in reactive environments, can also be 
shown. 

In summary, M is an excellent sequence predictor under the only assumption 
that the observed sequence is drawn from some (unknown) computable probability 
distribution. No ergodicity, stationarity, or identifiability or other assumption is 
required. 

Universal sequential decisions. Predictions usually form the basis for decisions 
and actions, which result in some profit or loss. Let ixty^ € [0,1] be the received loss 
for decision yt&y when xt^X turns out to be the true t^^ symbol of the sequence. 
The p-optimal strategy 

yt'{^<t) ■= argmin Vp(a;t|a;<t)4tj/t 
yt ^ — ^ 

minimizes the p-expected loss. For instance, if we can decide among y = {sunglasses, 
umbrella} and it turns out to be A" = {sun,rain} , and our personal loss matrix 
is i = g'3), then Ap takes y^f =sunglasses if p{rain\u^t) < Vs and an umbrella 
otherwise. For X = y and 0-1 loss ixy = for x = y and 1 else, Ap predicts the most 
likely symbol yf'f = a.TgmaXaP{ci\<^<t) as in Section [2J 
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The cumulative /i(=true)-expected loss of Ap for the first n symbols is 

n n 

Loss^ := ^E[C,j,Ap(<^<,)] = ^5]]/i(a;i:t)4tyAp(x<0 

t = l t=l X\;t 

If is known, obviously results in the best decisions in the sense of achieving 
minimal expected loss among all strategies. For the predictor A^ based on M (and 
similarly ^u), one can show 

v/Loss^,^" - v^L^^ < V2K(/i) ln2 + 0(1) (5) 

This implies that Loss^"/Loss^'' — )■ 1 for Loss^** — )■ oo, or if Loss^'' is finite, then 
also Loss^'*^ < oo. This shows that M (via Km) performs also excellent from a 
decision-theoretic perspective, i.e. suffers loss only slightly larger than the optimal 
Ap strategy. 

One can also show that Am is Pareto-optimal (admissible) in the sense that every 
other predictor with smaller loss than Am in some environment u &M.u must be 
worse in another environment. 

Universal classification and regression. The goal of classification and regression 
is to infer the functional relationship f-.y^X from data {(?/i,Xi),...,(?/„,x„)}. In 
a predictive online setting one wants to "directly" infer xt from yt given {y<t,x<t) 
for t = 1,2,3,.... The universal induction framework has to be extended by regarding 
?/i:oo as independent side-information presented in form of an oracle or extra tape 
information or extra parameter. The construction has to ensure that Xi-n only 
depends on but is (functionally or statistically) independent of yn+i-.oo- 

First, we augment a monotone Turing machine with an extra input tape contain- 
ing yi.oo- The Turing machine is called chronological if it does not read beyond yi^n 
before Xi:„ has been written. Second, semimeasures p = fi,h',M,^u are extended to 
p{xi:n\yi:oo) , i-G. ouc scmimcasure p{-\yi:oo) for each yi^^o (no distribution over y is as- 
sumed). Any such semimeasure must be chronological in the sense that p{xi;n\yi:oo) 
is independent of yt for t>n, hence we can write p{xi-n\yi:n)- In classification and 
regression, p is typically (conditionally) i.i.d., i.e. p{xi;n\yi:n) = YYt=iPi^t\yt) , which is 
chronological, but note that the Bayes mixture ^ is not i.i.d. One can show that the 
class of lower semi- computable chronological semimeasures 7Vl|y = {z/i(-|-),z/2(-| ■),...} 
is effectively enumerable. 

The generalized universal a-priori semimeasure also has two equivalent defini- 
tions: 

M(a;i.„|yi;„) := 2-^(f) = 2-^('^)z/(xi:„|i/i.„) (6) 

p:U{p,yi:„)=xi:„ u£M 

which is again in M}^. In case of 1 3^ I = 1, this reduces to ([1]) and ([3]). The bounds 
(jl]) and ([5]) and others continue to hold, now for all individual y's, i.e. M predicts 
asymptotically Xt from yt and (i/<f,x<t) for any y, provided x is sampled from a 



9 



computable probability measure fi{-\yi:oo) ■ Convergence is rapid if fi is not too 
complex. 

Universal reinforcement learning. The generalized universal a-priori semimea- 
sure ([6]) can be used to construct a universal reinforcement learning agent, called 
AIXI. In reinforcement learning, an agent interacts with an environment in cycles 
t=l,2,...,n. In cycle t, the agent chooses an actionyt (e.g. a limb movement) based on 
past perceptions a;<t and past actions |/<t. Thereafter, the agent perceives xt = Otrt, 
which consists of a (regular) observation Ot (e.g. a camera image) and a real-valued 
reward r^. The reward may be scarce, e.g. just +1 (-1) for winning (losing) a chess 
game, and at all other times. Then the next cycle t+1 starts. The goal of the agent 
is to maximize its expected reward over its lifetime n. Probabilistic planning deals 
with the situation in which the environmental probability distribution fJ^ixi-nlvi-.n) 
is known. Reinforcement learning deals with the case of unknown /i. In universal 
reinforcement learning, the unknown /i is replaced by M similarly to the prediction, 
decision, and classification cases above. The universally optimal action in cycle t is 
|Hut05j 

yt := argmax ^^...max + ... +r„]M(a;i;„|yi:„) (7) 

Xt Xji 

The expectations (E) and maximizations (max) over future x and y are interleaved 
in chronological order to form an expectimax tree similarly to minimax decision 
trees in extensive zero-sum games like chess. Optimality and universality results 
similar to the prediction case exist. 

Approximations and practical applications. Since K and M are only semi- 
computable, they have to be approximated in practice. For instance, — logM(x) = 
K{x) + 0{logi{x)), and K{x) can and has been approximated by off-the-shelf com- 
pressors like Lempel-Ziv and successfully applied to a plethora of clustering problems 
[CVOS] . The approximations upper-bound K{x) and e.g. for Lempel-Ziv converge 
to K{x) if X is sampled from a context tree source. The Minimum Description 
Length principle |Gru07j also attempts to approximate K{x) for stochastic x. The 
Context Tree Weighting algorithm considers a relatively large subclass of Mu that 
can be summed over efficiently. This can and has been combined with Monte-Carlo 
sampling to efficiently approximate AIXI ([7]) [VNHSIO] . The time-bounded versions 
of K and M, namely Levin complexity Kt and the speed prior S have also been 
applied to various learning tasks |Gag07| . 

Other applications. Continuously parameterized model classes are very common 
in statistics. Bayesian's usually assume a prior density ovei some parameter O&M'^, 
which works fine for many problems, but has its problems. Even for continuous 
classes Ai, one can assign a (proper) universal prior (not density) -.= 2-^^^^ > 
for computable 9 (and z/g), and for uncomputable ones. This effectively reduces Ai 
to a discrete class {uq^J^-.w^ >0}<^Aiu which is typically dense in Ai. There are 
various fundamental philosophical and statistical problems and paradoxes around 
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(Bayesian) induction, which nicely disappear in the universal framework. For in- 
stance, universal induction has no zero and no improper p(oste)rior problem, i.e. 
can confirm universally quantified hypotheses, is reparametrization and represen- 
tation invariant, and avoids the old-evidence and updating problem, in contrast to 
most classical continuous prior densities. It even performs well in incomputable 
environments, actually better than latter |Hut07j . 

6 Discussion and Future Directions 

Universal learning is designed to work for a wide range of problems without any a- 
priori knowledge. In practice we often have extra information about the problem at 
hand, which could and should be used to guide the forecasting. One can incorporate 
it by explicating all our prior knowledge z, and place it on an extra input tape of 
our universal Turing machine [/, or prefix our observation sequence x hj z and use 
M{zx) for prediction. 

Another concern is the dependence of K and M on U. The good news is that a 
change of U changes K{x) only within an additive and M{x) within a multiplica- 
tive constant independent of x. This makes the theory practically immune to any 
"reasonable" choice of U for large data sets x, but predictions for short sequences 
(shorter than typical compiler lengths) can be arbitrary. One solution is to take 
into account our (whole) scientific prior knowledge z [HutOG] , and predicting the 
now long string zx leads to good (less sensitive to "reasonable" U) predictions. This 
is a kind of grand transfer learning scheme. It is unclear whether a more elegant 
theoretical solution is possible. 

Finally, the incomputability of K and M prevents a direct implementation of 
Solomonoff induction. Most fundamental theories have to be approximated for prac- 
tical use, sometimes systematically like polynomial time approximation algorithms 
or numerical integration, and sometimes heuristically like in many Al-search prob- 
lems or in non-convex optimization problems. Universal machine learning is similar, 
except that its core quantities are only semi-computable. This makes them often 
hard, but as described in the previous section, not impossible, to approximate. 

In any case, universal induction can serve as a "gold standard" which practition- 
ers can aim at. Solomonoff's theory considers the class of all computable (stochas- 
tic) models, and a universal prior inspired by Ockham and Epicurus, quantified by 
Kolmogorov complexity. This lead to a universal theory of induction, prediction, 
decisions, and, by including Bellman, to universal actions in reactive environments. 
Future progress on the issues above (incorporating prior knowledge, getting rid of 
the compiler constants, and finding better approximations) will lead to new insights 
and will continually increase the number of applications. 
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